Goiânia
TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use
He, Pengfei, Dai, Zhenwei, He, Bing, Liu, Hui, Tang, Xianfeng, Lu, Hanqing, Li, Juanhui, Ding, Jiayuan, Mukherjee, Subhabrata, Wang, Suhang, Xing, Yue, Tang, Jiliang, Dumoulin, Benoit
Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.
- Europe > Austria > Vienna (0.14)
- Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)
- Europe > France (0.04)
- (35 more...)
- Leisure & Entertainment (1.00)
- Consumer Products & Services > Travel (1.00)
- Media > Music (0.96)
- (2 more...)
Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction
Gomes, Juliana Resplande Sant'anna, Filho, Arlindo Rodrigues Galvão
The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets ( corpora) that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (Fake.Br, COVID19.BR, MuMiN-PT) with external evidence. The approach simulates a user's verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and pre-processing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora. The main results demonstrate the methodology's viability, providing enriched corpora and analyses that confirm the utility of claim extraction, the influence of original data characteristics on the process, and the positive impact of enrichment on the performance of classification models (Bertimbau and Gemini 1.5 Flash), especially with fine-tuning. This work contributes valuable resources and insights for advancing SAFC in Portuguese.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Brazil > Rio Grande do Sul > Porto Alegre (0.04)
- (31 more...)
- Research Report (0.70)
- Overview (0.67)
- Information Technology > Services (1.00)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.94)
- Media > News (0.70)
VLMs as GeoGuessr Masters: Exceptional Performance, Hidden Biases, and Privacy Risks
Huang, Jingyuan, Huang, Jen-tse, Liu, Ziyi, Liu, Xiaoyuan, Wang, Wenxuan, Zhao, Jieyu
Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, significant challenges remain, including biases and privacy concerns. To systematically address these issues in the context of geographic information recognition, we introduce a benchmark dataset consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to $53.8\%$ accuracy in city prediction, they exhibit significant regional biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed ($-12.5\%$) and sparsely populated ($-17.0\%$) areas. Moreover, the models exhibit regional biases, frequently overpredicting certain locations; for instance, they consistently predict Sydney for images taken in Australia. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.
- North America > United States > California > Los Angeles County > Los Angeles (0.15)
- Asia > India > Karnataka > Bengaluru (0.14)
- North America > United States > New York (0.06)
- (74 more...)
Large Language Model for Qualitative Research -- A Systematic Mapping Study
Barros, Cauã Ferreira, Azevedo, Bruna Borges, Neto, Valdemar Vicente Graciano, Kassab, Mohamad, Kalinowski, Marcos, Nascimento, Hugo Alexandre D. do, Bandeira, Michelle C. G. S. P.
The exponential growth of text-based data in domains such as healthcare, education, and social sciences has outpaced the capacity of traditional qualitative analysis methods, which are time-intensive and prone to subjectivity. Large Language Models (LLMs), powered by advanced generative AI, have emerged as transformative tools capable of automating and enhancing qualitative analysis. This study systematically maps the literature on the use of LLMs for qualitative research, exploring their application contexts, configurations, methodologies, and evaluation metrics. Findings reveal that LLMs are utilized across diverse fields, demonstrating the potential to automate processes traditionally requiring extensive human input. However, challenges such as reliance on prompt engineering, occasional inaccuracies, and contextual limitations remain significant barriers. This research highlights opportunities for integrating LLMs with human expertise, improving model robustness, and refining evaluation methodologies. By synthesizing trends and identifying research gaps, this study aims to guide future innovations in the application of LLMs for qualitative analysis.
- South America > Brazil > Goiás > Goiânia (0.05)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- South America > Brazil > Santa Catarina > Florianópolis (0.04)
- (3 more...)
- Overview (1.00)
- Research Report > Experimental Study (0.46)
Fair Railway Network Design
He, Zixu, Botan, Sirin, Lang, Jérôme, Saffidine, Abdallah, Sikora, Florian, Workman, Silas
When designing a public transportation network in a country, one may want to minimise the sum of travel duration of all inhabitants. This corresponds to a purely utilitarian view and does not involve any fairness consideration, as the resulting network will typically benefit the capital city and/or large central cities while leaving some peripheral cities behind. On the other hand, a more egalitarian view will allow some people to travel between peripheral cities without having to go through a central city. We define a model, propose algorithms for computing solution networks, and report on experiments based on real data.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > France > Pays de la Loire > Loire-Atlantique > Nantes (0.04)
- (64 more...)
Emotion Talk: Emotional Support via Audio Messages for Psychological Assistance
Almada, Fabrycio Leite Nakano, Mariano, Kauan Divino Pouso, Dutra, Maykon Adriell, Monteiro, Victor Emanuel da Silva
This paper presents "Emotion Talk," a system designed to provide continuous emotional support through audio messages for psychological assistance. The primary objective is to offer consistent support to patients outside traditional therapy sessions by analyzing audio messages to detect emotions and generate appropriate responses. The solution focuses on Portuguese-speaking users, ensuring that the system is linguistically and culturally relevant. This system aims to complement and enhance the psychological follow-up process conducted by therapists, providing immediate and accessible assistance, especially in emergency situations where rapid response is crucial. Experimental results demonstrate the effectiveness of the proposed system, highlighting its potential in applications of psychological support.
- South America > Brazil > Goiás > Goiânia (0.05)
- South America > Brazil > Paraíba > João Pessoa (0.04)
Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks
Simões, Lucca Emmanuel Pineli, Rodrigues, Lucas Brandão, Silva, Rafaela Mota, da Silva, Gustavo Rodrigues
The integration of automation and voice control in drone systems has received significant attention in recent research, driven by the need for more intuitive and efficient human-machine interaction [4, 1]. This project focuses on developing a voice command system for the Tello drone, utilizing speech recognition and deep learning models to translate voice commands into precise drone actions. The primary challenge addressed by this project is the accurate and efficient translation of voice commands into specific drone operations. This is particularly crucial in scenarios where traditional control interfaces are impractical or where operators require hands-free operation [10, 5]. To address this challenge, we developed and evaluated three distinct pipelines. The first pipeline uses a traditional Speech-to-Text (STT) model followed by a Large Language Model (LLM) for command interpretation [11]. The second pipeline involves a direct mapping model that predicts drone commands from audio inputs without intermediate text conversion. The third pipeline employs a Siamese neural network to generalize new commands by comparing audio inputs to pre-trained examples [8]. Each pipeline was designed to balance performance, flexibility, and ease of maintenance.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Aprendizado de m\'aquina aplicado na eletroqu\'imica
Araújo, Carlos Eduardo do Egito, Sgobbi, Lívia F., Sene, Iwens Gervasio Jr, de Carvalho, Sergio Teixeira
This systematic review focuses on analyzing the use of machine learning techniques for identifying and quantifying analytes in various electrochemical applications, presenting the available applications in the literature. Machine learning is a tool that can facilitate the analysis and enhance the understanding of processes involving various analytes. In electrochemical biosensors, it increases the precision of medical diagnostics, improving the identification of biomarkers and pathogens with high reliability. It can be effectively used for the classification of complex chemical products; in environmental monitoring, using low-cost sensors; in portable devices and wearable systems; among others. Currently, the analysis of some analytes is still performed manually, requiring the expertise of a specialist in the field and thus hindering the generalization of results. In light of the advancements in artificial intelligence today, this work proposes to carry out a systematic review of the literature on the applications of artificial intelligence techniques. A set of articles has been identified that address electrochemical problems using machine learning techniques, more specifically, supervised learning.
- South America > Brazil > Goiás > Goiânia (0.04)
- South America > Brazil > Minas Gerais > Itajubá (0.04)
- Asia > China (0.04)
Emissions Reporting Maturity Model: supporting cities to leverage emissions-related processes through performance indicators and artificial intelligence
Xavier, Victor de A., França, Felipe M. G., Lima, Priscila M. V.
Climate change and global warming have been trending topics worldwide since the Eco-92 conference. However, little progress has been made in reducing greenhouse gases (GHGs). The problems and challenges related to emissions are complex and require a concerted and comprehensive effort to address them. Emissions reporting is a critical component of GHG reduction policy and is therefore the focus of this work. The main goal of this work is two-fold: (i) to propose an emission reporting evaluation model to leverage emissions reporting overall quality and (ii) to use artificial intelligence (AI) to support the initiatives that improve emissions reporting. Thus, this work presents an Emissions Reporting Maturity Model (ERMM) for examining, clustering, and analysing data from emissions reporting initiatives to help the cities to deal with climate change and global warming challenges. The Performance Indicator Development Process (PIDP) proposed in this work provides ways to leverage the quality of the available data necessary for the execution of the evaluations identified by the ERMM. Hence, the PIDP supports the preparation of the data from emissions-related databases, the classification of the data according to similarities highlighted by different clustering techniques, and the identification of performance indicator candidates, which are strengthened by a qualitative analysis of selected data samples. Thus, the main goal of ERRM is to evaluate and classify the cities regarding the emission reporting processes, pointing out the drawbacks and challenges faced by other cities from different contexts, and at the end to help them to leverage the underlying emissions-related processes and emissions mitigation initiatives.
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.14)
- Europe > Italy (0.14)
- South America > Brazil > São Paulo (0.04)
- (12 more...)
- Government (1.00)
- Law > Environmental Law (0.93)
- Banking & Finance (0.67)
- Energy (0.67)
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models
Tu, Shangqing, Sun, Yuliang, Bai, Yushi, Yu, Jifan, Hou, Lei, Li, Juanzi
To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For \textbf{benchmarking procedure}, to ensure an apples-to-apples comparison, we first adjust each watermarking method's hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For \textbf{task selection}, we diversify the input and output length to form a five-category taxonomy, covering $9$ tasks. (3) For \textbf{evaluation metric}, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate $4$ open-source watermarks on $2$ LLMs under $2$ watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at \url{https://github.com/THU-KEG/WaterBench}.
- Africa > Ghana (0.05)
- Oceania > Australia (0.04)
- North America > United States > Texas (0.04)
- (26 more...)
- Personal (0.92)
- Research Report > New Finding (0.46)
- Materials > Metals & Mining > Gold (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)